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Tuberculosis (TB) is an infectious disease caused by the bacteria Mycobacterium 
tuberculosis (Mtb) that affects millions of people worldwide. The majority of individuals 
who are exposed to Mtb develop latent infections, in which an immunological response to 
Mtb antigens is present but there is no clinical evidence of disease. Because currently 
available tests cannot differentiate latent individuals who are at low risk from those 
who are highly susceptible to developing active disease, there is considerable interest 
in the identification of diagnostic biomarkers that can predict reactivation of latent TB. We 
present results from our analysis of a controlled longitudinal experiment in which a group 
of rhesus macaques were exposed to a low dose of Mtb to study their progression 
to latent infection or active disease. Subsets of the animals were then euthanized at 
scheduled time points, and granulomas taken from their lungs were assayed for gene 
expression using microarrays. The clinical profiles associated with the animals following 
Mtb exposure revealed considerable variability, and we developed models for the disease 
trajectory for each subject using a Bayesian hierarchical B-spline approach. Disease 
severity estimates were derived from these fitted curves and included as covariates 
in linear models to identify genes significantly associated with disease progression. 
Our results demonstrate that the incorporation of clinical data increases the value of 
information extracted from the expression profiles and contributes to the identification 
of predictive biomarkers for TB susceptibility. 
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1. INTRODUCTION 

Tuberculosis (TB) is an infectious disease caused by the bacteria 
Mycobacterium tuberculosis (Mtb) that affects millions of people 
worldwide. While a small fraction of individuals who are exposed 
to Mtb either completely clear the infection or develop active 
disease, the majority enter a state of latency, in which an immuno- 
logical response to Mtb antigens is present (typically diagnosed 
by a positive response to a skin test) but there is no clinical 
evidence of infection. However, there is not a clear delineation 
between latent and active infection, with levels of bacterial activity 
encompassing a latency spectrum that varies considerably among 
individuals (Barry et al, 2009; Delogu et al, 2013). This consid- 
erably complicates the diagnosis and proactive treatment of TB 
infection, as currently available tests cannot differentiate individ- 
uals who are at low risk of disease progression from those who are 
highly susceptible to developing active disease. For this reason, 
there is much interest in the identification of diagnostic biomark- 
ers that can predict reactivation of latent TB (Wallis et al., 2010; 
Walzl et al, 2011; Ottenhoff et al, 2012). 

One approach for studying the progression of Mtb infection 
from initial exposure to latency is through controlled time-course 



experiments. Non-human primates (NHPs), such as rhesus or 
cynomolgus macaques, display a continuum of TB infection 
which cannot be reproduced in other animal models (Capuano 
et al, 2003; Dutta et al, 2010; Mehra et al., 2010, 2011, 2013) 
and are therefore optimally suited for this type of research. In 
experiments conducted at the Tulane National Primate Research 
Center (TNPRC), a set of 17 rhesus macaques were exposed to 
a low dose of Mtb bacteria and monitored for clinical signs of 
disease. Subsets of the animals were then euthanized at scheduled 
time points, and granulomas taken from their lungs were analyzed 
for gene expression using microarrays. While in an ideal setting 
the combined expression profiles produced from this experiment 
would provide a valuable snapshot into the processes by which 
latency is established at the genomic level, this would require that 
all of our subjects have a common experience of infection. In our 
study, a review of the clinical profiles associated with the animals 
revealed considerable variability in the nature of their respec- 
tive illnesses, with some subjects exhibiting no clinical signs of 
infection and others presenting symptoms associated with severe 
disease. Based on this heterogeneity, we considered that an inte- 
grated analysis of our gene expression data that incorporated 
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each animal's clinical history would provide greater insights into 
the genetic processes associated with the establishment of latent 
infection than one which considered only the length of exposure 
associated with each profile. 

The term clinico- genomic modeling describes a large class of 
methods that seek to integrate data from high-throughput exper- 
iments with clinical information, typically for diagnostic or prog- 
nostic purposes (Gevaert et al., 2006). For example, a recent 
study of primary breast cancer recurrence demonstrated that the 
incorporation of traditional clinical risk factors improved disease 
outcome predictions over statistical classification tree models that 
only used genomic data (Pittman et al., 2004). However, to our 
knowledge ours is the first effort to incorporate longitudinal clin- 
ical data from a controlled experiment in the identification of 
genomic biomarkers from gene expression profiles. 

We first incorporate a number of clinical covariates to model 
disease progression for each subject using a Bayesian hierarchical 
B-spline approach. Our methods are similar to those used in the 
development of longitudinal models for lung function in children 
observed in the presence of varying levels of ambient air pollu- 
tion (Berhane and Molitor, 2008), although in our setting the 
response variable is an unobserved disease severity score rather 
than a measured quantity. From our estimated clinical trajecto- 
ries, we can visualize and quantitatively compare the progression 
from exposure to latent infection in each of our subjects. We then 
incorporate disease severity estimates derived from these curves 
along with temporal information to identify genes that are sig- 
nificantly associated with either or both of these predictors. Our 
results demonstrate that the incorporation of clinical data signif- 
icantly increases the amount of information extracted from the 
expression profiles and is consistent with other recent efforts to 
identify biomarkers for TB susceptibility. 

2. MATERIALS AND METHODS 
2.1. EXPERIMENTAL DESIGN 

Seventeen Indian rhesus macaques were exposed to 100 cfu Mtb 
CDC1551 viaMffo aerosol, a dose which is known to induce latent 
TB based on prior research (Dutta et al., 2010). The animals 
were randomly assigned to time points at 2-week intervals for 
experimental euthanasia, at which point their lungs were biop- 
sied and granuloma lesions were extracted. The experiment was 
conducted using previously established protocols (Dutta et al., 
2010; Mehra et al., 201 1), and all procedures were approved by the 
TNPRC Institutional Animal Care and Use Committee (IACUC) 
and the Tulane University Institutional Biosafety Committee 
(IBC). During the course of the post-exposure period, the sub- 
jects were routinely weighed and monitored for body temper- 
ature at regular intervals, with temperature reported as the 
change from a pre-exposure baseline (denoted by TMP) and 
weight reported as a percentage of the pre-exposure weight 
(denoted by WT). C-reactive protein (CRP), a known indi- 
cator of TB infection, was also monitored, and chest X-rays 
were taken at tri-weekly intervals and scored by a TNPRC 
radiologist on a 0-3 scale based on visual evidence of lesions 
(CXR). Gene expression profiles for extracted granulomas were 
assayed using Agilent TNPRC Macaca mulatta 4 x 44k Arrays 
(GPL10183) and scanned using the GenePix 4000B Scanner. All 



data files are available for download via the GEO Data Series 
GSE56919. 

2.2. HIERARCHICAL B-SPLINE MODELS 

We use piecewise polynomials to estimate the trajectory of 
the unobserved disease progression z(f) as a linear combina- 
tion of a set of B-Spline basis functions (de Boor, 1993). We 
restrict z(t) to the family of non-negative cubic polynomials 
on [0, fi] and [n, x{\, where 0 < X\ < %i, with continuous sec- 
ond derivatives globally on [0, x 2 \. Then z(t) can be expressed 
as z(f) = c 0 B 0 , 3 (f) + ciBi, 3 (f) + c 2 B 2 , 3 (t) + c 3 B 3 , 3 (*) + c 4 B 4 , 3 (r) 
where {en, c\, c%, c 3 , c 4 } is a set of non-negative coefficients 
and {B 0 , 3 (f), B u (f), B 2 , 3 (f), B 3 , 3 (f), B 4 , 3 (f)} is the set of B spline 
basis functions. For each subject, x 2 is the duration of time from 
exposure to euthanasia and is known for each subject, while x\ is 
an interior transition point (or "knot") that is estimated. We use 
the notation z(t\0) where 0 = (co, c\, c 2 , c 3 , c 4 , Ti, r 2 ). 

The four variables WT, TMP, CRP, and CXR were measured 
longitudinally for the n s = 17 experimental subjects. To reduce 
the high degree of fluctuation in CRP measurements, we dis- 
cretized the observations to fall into one of three ordered groups 
based on previous clinical observations: Level 1 for CRP = 0; 
Level 2 for 0<CRP<=10; and Level 3 for CRP > 10. 

Let y v u be the value of the vth variable for the ith subject 
at the time point t v u, where v= 1,2,3,4 (which correspond 
to WT, TMP, CRP, and CXR, respectively), i = 1, 2, • • • , n s and 
/ = 1, 2, • • ■ ,n vl denotes the time points observed for the corre- 
sponding subject and variable. At the initial time oiMtb exposure 
for each subject, we set y ;o = {100, 0, 0, 0}, denoting the baseline 
state for each subject i. Let Y = y v u be the array of all observations 
for all subjects. 

We define the following hierarchical model for the relationship 
between the observed clinical variables and unobserved disease 
state z(t\6): 

For i = 1, 2, . . , , n s : 

yiil = ho + b n z(tiu\0i) + em 

for/= 1,2--- ,«!, (1) 
yiil = ho + h\z(hii\0i) + € 2 n 

for/ = 1,2, ••• ,m 2 , (2) 
logit(P(y 3> , < g)) = fi g + h\z{h n \0i) 

for /= 1, 2, •• • , n 3l andg = 0, 1 (3) 
logit(P(y 4 «7 < h)) = a h + hiz(Uu\0i) 

for /= 1, 2, •• • , n 4l and ft = 0, 1, 2 (4) 

Random errors e v u ~ N(0, 07), v = 1, 2 are independent for each 
subject and variable. Equations (3, 4) are proportional odds mod- 
els (Agresti, 2002) for the discretized CRP y 3i; e {0, 1, 2} and the 
CXR y 4 /; € {0, 1, 2, 3} in which logit(x) = ^ and fi g and ah 
are non-decreasing in g and h. The parameters b v j and are 
common for all subjects while 0,- represents the effects specific to 
subject i, and we assume that the measurements y v n, v = 1, 2, 3, 4 
are conditionally independent given b v j, <7y, and 
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We note that the slopes b v \ and the disease states z(t v u\0,) are 
not individually identifiable in our model since they only appear 
as a product. However, this is simply a matter of scaling and does 
not impact our ability to compare estimated disease trajectories 
across subjects. 

2.2. 1. Parameter estimation 

Let 0; = ( c;o >c,i )Q2 >Cj'3 ,c;4 ,r;i ,r,2 ) be the parameter vector for sub- 
ject i. The disease progression function for the ith subject z;(t) 
is defined on the interval [0, r^], in which tq is set to be the 
final time point, corresponding to the last measurement for any 
of the four clinical covariates. We assume that the pre-exposure 
disease state z(0|#,) = 0, so we set c l0 = 0. Therefore, among the 
components of 0;, we need only estimate Ci\,c&,CQ,cn,Ti\. 

The intercepts feio, £>20> A), Pi, a o, «i, <*2 in Equations (1-4) 
determine the distribution of the corresponding clinical variables 
given the disease state is 0. Since initial values for WT and TMP 
were constant for all subjects at 100 and 0, respectively, the inter- 
cept terms bw, ho were also fixed. Intercepts for the CRP and 
CXR probabilities at disease state 0 were also set to reflect the fact 
that pre-exposure measurements for these variables were fixed at 
0 for all subjects. Values for the vector {/jo, fi\, oto, a\, 0:2} were set 
to be {5, 9, 5, 9, 13}, corresponding to the initial state probabil- 
ity P(CRP = 0) = P(CXP = 0) = ^ffl^) = 0.993 (while these 
were somewhat arbitrary choices, we note that adjustments to 
these initial values were observed to have a negligible impact on 
the fitted models). 

The remaining parameters are {b, a 1 , 9\ , ■ ■ ■ ,6„ s } where b = 
(fen, fe 2 i, hi, hi), o 2 = (erf, c 2 2 ), and 0, = (en, c a , cq, q 4 > tji) 
for i = 1, 2, • • • , n s . Equations (1-4) specify that the WT and 
TMP values ym and yn\ are normally distributed while ym and 
y±H have multinomial distributions. Using the conditional inde- 
pendence of the clinical variables, the joint likelihood function of 
the parameters L(b, a , 0\, ■ ■ ■ , 0„ s \y) is just the product of the 
marginal likelihoods. 

We set prior distributions for these parameters to reflect 
our biological intuition and imposed boundary constraints to 
account for the lack of identifiability associated with our model. 
As we expect that increasing disease severity should be associated 
with weight loss, elevated body temperature, and/or higher lev- 
els of CRP and CXR, we use truncated diffuse normal priors on 
the slopes: b n ~ N(0, 100)Ir_io,o], hi ~ N(0, 100)I [0 ,io], hi ~ 
N{0, 100)I [ _io, 0 ], hi ~ N(0, 100)I[_io, 0 ], where N(fi 0 , ^)I [a ,b] 
is the truncated normal distribution with mean /zn and variance 
§q whose support is the interval [a, fe]. These priors, which cor- 
respond to the scaled normal distribution with support on the 
interval between the mean and one standard deviation to its right 
or left, place higher weight on values close to 0 but are not highly 
restrictive. For the variance parameters, prior distributions were 
motivated by preliminary inspection of the observed variability in 
the data and bounds were set to constrain the estimates to a rea- 
sonable range of values: l/a 2 ~ [/[0.001, 0.6], l/a 2 ~ [7[1, 25]. 
Priors for the basis functions were defined by r,i ~ L/[0, xq\, and 
c,j ~ N(0, 10)/ [0 ,+oo) for; e 1, 2, 3, 4. 

We estimated the parameters by sampling from the posterior 
distribution using MCMC as implemented in WinBUGS (Lunn 
et al, 2000). After a burn-in period of 1000 iterations, we ran 



the MCMC algorithm for 15,000 iterations and thinned the chain 
by taking every third point to produce an effectively uncorrelated 
sample of T = 5000 points from the posterior distribution. 

2.2.2. Simulation study 

To test our method's ability to accurately estimate the progression 
of a specified disease trajectory from observed clinical covari- 
ates, we constructed 4 hypothetical disease scenarios and then 
simulated associated observations for WT, TMP, CRP, and CXR. 

Our simulations represented the following cases: (1) progres- 
sion to severe illness; (2) asymptomatic infection; (3) recovery 
following acute illness, and (4) gradual progression to moder- 
ate illness. Given a specified disease trajectory, we generated 100 
simulated datasets of weekly clinical observations according to 
Equations (1-4). At each time point, WT and TMP observa- 
tions were randomly generated from the normal distributions 
determined by Equations (1, 2) and the categorical CRP and 
CXR observations were taken to be the level with the maxi- 
mum probability determined by Equations (3, 4). We set the 
intercepts feio = 100, ho = 0, /Jo = 5, f}\ = 9, c#o = 5, ot\ = 9, 
CK2 = 13 as above. The slopes were set to be bn = —2, hi = 0.6, 
= —3, hi = —3.5 and the precisions were set to be l/a 1 = 
0.11 and 1 /cr| = 2. These parameters were picked so that the sim- 
ulated values resembled the observed measurements in our TB 
experimental data. 

For each simulated dataset, we sampled from the posterior 
distributions of the parameters (including slopes, precisions and 
subject-specific parameters c 1 i,c,-2,c;3,c,'4,r;i) via MCMC. Due to 
the large number of samples in our study, we reduced our compu- 
tational requirements to include a burn-in period of 50 iterations 
and then we retained every third sample of the following 1000 
iterations. The empirical median of the posterior distribution for 
each parameter was chosen as its estimate, and these were then 
used to generate the estimated trajectories of the disease state for 
the four subjects. 

2.3. LINEAR MODELS FOR GENE EXPRESSION PROFILES 

Following estimation of the disease trajectories for each subject, 
we employed linear models to detect genes that are significantly 
associated with disease features. 

In this setting, we fit gene-wise models as described in Smyth 
(2004), in which the response variables were the normalized log 
expression value associated with gene g for each of the i sub- 
jects and the covariates were derived from the estimated trajectory 
kit). 

Our two-channel microarray data included expression levels 
of 44,449 gene probes for 17 subjects, arranged in a loop design 
that included other control samples. We analyzed the two chan- 
nels individually, normalized the arrays using the Bioconductor 
packages LIMMA and OLIN (Smyth, 2005; Futschik, 2012), and 
extracted the data for the samples of interest. 

To focus our attention on genes that would likely be associated 
with disease progression, we removed probes that showed insuf- 
ficient variability across the subjects. Letting u gl denote the log 2 
expression level for the gth gene probe of the ith subject, we used 
the following filtering criterion: exclude the gth gene probe from 
the analysis if (max, u g \ — min, u gI ) < 1.5. 
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For / clinical features Z\ , Z2 , ■ ■ ■ , Zj as predictors of the 
expression profiles for the G probes remaining after the filtering 
step, we fit models of the form 



: Xp + e g and e g ~ N(0, ajl) 



(5) 



for each g = 1, 2, • • • , G where u g - 
the S x (/+ 1) design matrix X 



(u g i, U g 2, • • • , U gS ) T € 

/ 1 Zll Z21 • • • Z/l \ 

1 Z\2 Z22 ■■■ ZJ2 



and 



\ 1 ZlS Z 2 S ■■■ Zj S J 

P = (|U,g, agi, (Xg2, ■ ■ ■ ,«g/) r . The intercept \i g is the mean 
expression level for the gth probe across the subjects and 
OLg\, a g 2, ■ ■ ■ ,oig] are the slopes of the / clinical features for the 
gth probe. 

We determined that a particular gene probe was significantly 
associated with the covariate combination Z\,Zi,--- ,Zjii the 
p-values of the individual f-tests for the / covariates were all less 
than 0.05 and the p-value of the overall model F test Ho : a ? i = 
a g 2 = ■ ■ ■ oigj = 0 was less than 0.01. Given a set of competing 
models, the model with the smallest overall p-value for which all 
coefficients met the a = 0.05 significance level was selected as the 
best model for that probe. 

Probe-level results were aggregated to the set of unique Agilent 
probe IDs. Because pairs of replicate probes occasionally were 
best fit by models with differing parameterizations, the statisti- 
cal significance of the pooled coefficients was calculated by taking 
the geometric mean of the observed p-values (which were auto- 
matically assigned to the value 1 for any covariates that were not 
included in the best-fitting model for a given probe) and we reset 
any model coefficients that were not associated with aggregate 
p-values of 0.05 or less to be equal to 0. 

Hierarchical clustering was performed in R to classify the 
unique probe IDs with respect to the scaled matrix of fitted model 
coefficients, using the Euclidean distance and Ward's minimum 
variance method. 

2.4. BI0INF0RMATICS ANALYSIS 

Agilent probe IDs that were significantly associated with a fit- 
ted clinical model were mapped to annotated genes or genomic 
regions using the DAVID Gene ID conversion tool (Huang et al., 
2009a,b). Subsequently, the set of all unique mapped gene IDs was 
uploaded to the DAVID suite and analyzed for functional anno- 
tation. Our threshold for statistical significance of enriched terms 
was a Benjamini-Hochberg adjusted p-value of 0.05 or less. 

The set of significantly enriched SP-PIR keywords was ana- 
lyzed to determine whether representation of associated genes was 
equally distributed across the model-derived gene clusters. For 
each keyword, we conducted a Chi-Square goodness of fit test for 
the cluster distributions based on the null hypothesis of no asso- 
ciation and calculated the associated p-values with adjustment for 
multiple testing using the Benjamini-Hochberg correction. 

3. RESULTS 

Data for three of the subjects included in our study were found to 
be anomalous. In one case, terminal illness required immediate 



euthanasia, and in the two others we observed aberrant expres- 
sion profiles suggesting experimental error. Data for these sub- 
jects were discarded and the results presented here correspond to 
the 14 remaining subjects. 

3.1. MODELING OF SIMULATED TRAJECTORIES 

The results from our simulation study are shown in Figure 1. The 
100 estimated trajectories for each of the four cases were com- 
pared to the true disease trajectory. As mentioned previously, the 
disease state is not identifiable and is unique only up to a con- 
stant. Therefore, the comparison was done in relative terms to 
see whether our modeling approach can recover the "shape" of 
the disease state trajectories. Specifically, suppose z,-(f) is the true 
trajectory and z,(t) is any one of the 100 estimated trajectories 
for the z'th case where i = 1, 2, 3, 4. We expect that there exists a 
scaling constant y such that yz,(f) ~ z;(f) for all i, and the right 
panel of Figure 1 displays a plot of the 100 estimated trajectories 
Zf(t) against the scaled true trajectory yz,(f). In each case, our 
simulations faithfully recovered the underlying trajectory, thereby 
demonstrating the ability of our approach to accurately estimate 
model parameters in practice. 

3.2. FITTED TRAJECTORIES FOR CLINICAL PROFILES 

The left panel of Figure 2 displays the observed clinical variables 
for four typical subjects: HC90, HC20, HB74, FR67. The CRP 
used in the plot is the discretized CRP. Hierarchical B-Spline 
models Equations (1-4) were fitted for each of the subjects as 
described. Following MCMC estimation, we obtained empiri- 
cal posterior distributions for the common effects parameters, as 
shown in Figure 3 and summarized in Table 1 . 

While all parameters were statistically significant, estimated 
trajectories, as shown in the right panel of Figure 2, were more 
sensitive to changes in CRP and CXR than to those in TMP and 
WT. Table 2 summarizes certain key features of the fitted trajecto- 
ries for each subject. Time post-exposure T is defined by the time 
from initial exposure to euthanasia, final severity S is the value 
from the fitted disease trajectory at the final time point, onset time 
O is the time when the subject initially reaches a specified sever- 
ity level (which here was taken as 1.4 based on estimated severity 
scores associated with symptomatic illness), and the maximum 
severity M is the highest value attained for a given trajectory. 

3.3. LINEAR MODELS FOR GENE EXPRESSION 

After preliminary filtering, 14,504 candidate gene probes were 
retained for further investigation. Not surprisingly, preliminary 
analysis indicated that the predictors in Table 2 were highly cor- 
related, and so we chose to focus our attention on two of the least 
correlated predictors, time post-exposure T and final severity S. 
For each of these probes, the best-fitting regression model was 
chosen from among models containing all subsets of linear and 
quadratic terms for T as well as a linear term for S. A total of 
9130 probes were significantly associated with at least one of these 
models, and 8453 of these corresponded to annotated genes or 
genomic regions as identified by the DAVID Gene ID conversion 
tool. The probe level results were then aggregated to represent a 
set of 4864 unique Agilent probe IDs. As shown in Table 3, the 
great majority of genes (93%) were significantly associated only 
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subject 1 




FIGURE 1 | Left: An illustration of the simulated clinical observations for WX recovery (subject 3), and gradually progressing infection (subject 4). Right: 
TMR CRR and CXR associated with trajectories representing rapid disease Estimated disease state trajectories for 100 simulated datasets (green curves) 

progression (subject 1), minimal infection (subject 2), infection followed by and the associated scaled true trajectory (black curve) for each of the four cases. 



with post-exposure time X, with the remaining 7% associated 
with either severity S alone or with both S and T. 

Hierarchical clustering was performed on the scaled set of 
estimated model parameters to determine subsets of probe IDs 
that had the most similar characteristics with respect to their fit- 
ted models. Following visual inspection, we determined that 6 



clusters were sufficient to adequately classify the results, as shown 
in Figure 4. 

The largest cluster (which we denote Cluster 1) contained 2165 
probes which mapped to 1950 distinct gene IDs. These expression 
profiles were predominantly characterized by quadratic increases 
in gene expression over time T (with linear increases in the 
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FIGURE 2 | Left: Observed clinical variables for four typical subjects: HC90, HC20, HB74, FR67. CRP values are binned into 3 groups, corresponding to absent, 
low, or high CRP levels. Right: Estimated disease trajectories from Bayesian hierarchical B-Spline models incorporating the four clinical covariates. 



remaining few). Cluster 2 contained 102 probes that displayed a 
parabolic trend, initially increasing and then decreasing in expres- 
sion over time. These mapped to 100 distinct gene IDs. Cluster 
3 (251 probes, 247 unique gene IDs) included probes whose 
expression significantly increased as a function of severity score 
S. Approximately half of these probes were also significantly asso- 
ciated with time T, either as a quadratic or linear term. Cluster 



4 was the second largest (1891 probes, 1818 unique gene IDs) 
and exclusively included probes whose expression decreased as 
a quadratic function of time T. Cluster 5 was the smallest of 
the clusters, containing 74 probes mapping to 73 unique gene 
IDs. Expression profiles for probes in this cluster were all nega- 
tively associated with disease severity S, and the majority of these 
were also significantly associated with time T. Finally, Cluster 6 
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-0.6 -0.4 -0.2 o:o 0.00 0.05 0.10 0.15 0.20 0.25 



CRP CXR 




FIGURE 3 | The empirical densities of posterior distributions of the 
slopes b n (WT), D21 (TMP), fen (CRP), b 41 (CXR). 



Table 1 | Distributions of parameter estimates. 



Parameter 


Mean 


SD 


2.5% 


Median 


97.5% 


£>n (weight) 


-0.1026 


0.0874 


-0.3228 


-0.0804 


-0.0032 


br\ (temperature) 


0.0885 


0.0324 


0.0285 


0.0875 


0.1549 


b 3 i (CRP) 


-1.9971 


0.2141 


-2.4630 


-1.9860 


-1.6160 


b 41 (CXR) 


-2.5587 


0.2746 


-3.1631 


-2.5420 


-2.0809 


Maf (weight) 


0.0437 


0.0043 


0.0357 


0.0436 


0.0525 



1/cr 2 2 (temperature) 1.3532 0.1358 1.0950 1.3465 1.6340 



(381 probes, 373 gene IDs) included probes whose expression 
decreased linearly as a function of T. 

3.4. FUNCTIONAL ASSOCIATIONS 

The set of all unique gene IDs included in the six clusters 
was imported into the DAVID bioinformatics tool suite and 
analyzed for functional annotation. We found that this subset 
was significantly enriched for 181 GO Biological Process (BP) 
terms, 50 Cellular Component (CC) terms, 7 GO Molecular 
Function (MF) terms, 8 KEGG Pathways, 77 Swiss-Prot Protein 
Information Resource (SP-PIR) Keywords, and 5 UniProt 
Sequence Annotation (UP-SEQ) Features. To avoid redundancy, 
we present results for the 56 statistically significant SP-PIR key- 
words that were unambiguously defined in the Swiss-Prot con- 
trolled vocabulary of keywords (www.uniprot.org/docs/keywlist). 
Figure 5 displays these terms along with the relative proportion of 
gene IDs included within each expression profile cluster. Based on 
the observed numbers of gene IDs in each cluster, if the gene IDs 
represented by a given keyword were randomly associated with 
the set of six clusters we would expect the percentage of gene IDs 
by cluster to be distributed as follows: 42.4% in Cluster 1, 2.2% in 



Table 2 | Features of fitted disease trajectories. 



Subject 


Time 


Final 


Onset 


Maximum 




post-exposure 


severity 


time 


severity 


DVZ 1 


15.0 


2.95 


8. 1 


2.97 


CA75 


1 5.1 


1 .96 


12.0 


1 .96 


rb IU 


23.0 


2.20 


5.0 


2.20 


c Inc 
hJUb 


26.0 


2. 10 


22.1 


2.10 


rnb / 


16.1 


0.9b 


1 2.0 


1 .43 


HA77 


12.0 


3.84 


1 .1 


3.86 


HB74 


26.0 


2.06 


6.6 


2.06 


HC20 


21.0 


6.77 


2.9 


6.77 


HC38 


9.0 


6.25 


1.7 


6.59 


HC90 


25.0 


0.76 


1.3 


2.56 


HG80 


5.0 


2.86 


1.4 


2.88 


HJ01 


21.0 


3.91 


3.7 


4.04 


HJ91 


17.0 


2.92 


8.7 


3.05 


HJ93 


9.0 


4.51 


2.9 


4.51 



Table 3 | Best-fitting models for 4864 Unique Agilent Probe IDs. 



Clinical covariates Number of significantly associated genes 



Tj 


367 


Tf 


3959 


T„ Tf 


188 


S; 


145 


Tj, S; 


25 


Tf, Si 


151 


Tj, Tf, Si 


29 



Cluster 2, 5.5% in Cluster 3, 39.7% in Cluster 4, 1.6% in Cluster 
5, and 8.6% in Cluster 6. Chi-Square tests for random association 
of enriched genes with cluster membership identified 24 terms 
that were consistent with the expected cluster distribution, while 
the remaining 32 terms deviated significantly from the expected 
proportions. For a few terms, the deviations reflected an imbal- 
ance of gene IDs associated with quadratic temporal increases 
or decreases (Clusters 2 and 4), which would be expected to be 
observed in nearly equal proportions. For example, for the key- 
word "Hormone" 31 of 33 gene IDs were associated with Cluster 
2 (p = 0.005) while 14 of the 15 gene IDs associated with the 
keyword "Ubiquinone" were contained in Cluster 4. 

Of particular interest were the keywords that exhibited signifi- 
cant over-representation of genes associated with disease severity 
score, represented by Clusters 3 and 5. For the keyword inflam- 
matory response, 6 of the 32 included gene IDs (17.6%) were 
severity-associated (CCL2, CCL11, CCL20, CXCL1, CXCL3, and 
TLR8), and over 10% of the gene IDs mapped to the key- 
words "activator," "chemotaxis ," "chromosomal rearrangement," 
"cytokine," "electron transport," "endosome," and "SH3 domain" 
were included in Clusters 3 or 5. While relatively few genes were 
significantly down-regulated with increased severity, an interest- 
ing inclusion was the chemokine CXCR5. Recent studies have 
demonstrated that CXCR5 activity is essential for TB immune 
response (Gopal et al., 2013; Slight et al., 2013), and our results 
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FIGURE 4 | Hierarchical clustering of unique probe IDs based on fitted 
model coefficients. All coefficients were scaled prior to analysis to assign 
equal weight to each covariate. Red values denote coefficients that are 



positively associated with increases in the given covariates at the a = 0.05 
significance level, while values in blue denote significant negatively 
associated coefficients. 



suggest that expression of this chemokine may be deficient in 
our most symptomatic subjects. CCL2, a chemokine which was 
positively associated with severity in our analysis, has also been 
proposed as a biomarker based on experimental findings demon- 
strating that expression of this gene was elevated for the more 
severe cases within a group of human patients (Hasan et al., 2009; 
Hussain et al, 2011; Ansari et al., 2013). 

To put our results in a larger context, we compared our tran- 
scription profiles with those identified in recent studies seeking 
to identify TB expression biomarkers. In 2010, researchers pub- 
lished a list of 393 transcripts that were found to effectively 
discriminate between cases of active and latent TB in blood 
samples (Berry et al, 2010). We downloaded this list, which 
included 376 distinct gene IDs, and matched it to our set of 
1950 genes that were significantly associated with T and/or S. 
We found an overlap of 106 genes and applied hierarchical clus- 
tering to the expression profiles associated with this subset. The 
results, shown in Figure 6, clearly separated five of our sub- 
jects from the others. Interestingly, these correspond to five of 
the six lowest scoring cases using our fitted severity models, all 
of which had very low CXR scores, 0 CRP values, and low or 
undetectable levels of bacteria upon autopsy. A very low-scoring 
subject on our severity scale that was not included in this group 
was CA75, an animal which demonstrated considerable weight 
loss despite an absence of other clinical symptoms and whose 
expression profile more closely resembled those of some of the 
more symptomatic subjects. Overall, we found that there was 



a significant difference in both time post-exposure and sever- 
ity score between the subjects in the two clusters, although the 
temporal association may be due to the fact that the three ani- 
mals that were studied for over 24 weeks were also coincidentally 
among the least symptomatic cases. We also performed a hierar- 
chical clustering analysis on the set of profiles for 168 temporal - 
and/or severity associated genes included in a list of 409 gene 
IDs identified in a recent aggregate analysis of TB expression 
biomarkers identified in 7 prior studies (Joosten et al., 2013) and 
obtained identical clusters, suggesting that our results are fairly 
robust. 

4. DISCUSSION 

The majority of clinico-genomic modeling efforts to date have 
emphasized the aggregation of clinical and genetic data in the 
prediction of binary disease outcomes. However, for infectious 
diseases such as TB that are associated with a spectrum of con- 
ditions, such an approach is unlikely to illuminate the subtle 
variations in genetic function that might predispose one indi- 
vidual to develop a more severe infection than another. As our 
experimental data clearly demonstrate, the progression from Mtb 
exposure to the development of latent infection is a far from 
uniform process. Even in controlled experiments such as ours, 
reactions vary considerably and the clinical response is difficult 
to predict. Therefore, the analysis of gene expression profiles to 
understand the development of latent infection will be of limited 
value unless such variation is taken into account. 
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FIGURE 5 | Enriched SP-PIR keywords for the set of unique gene distribution of associated genes by model cluster is displayed using 
IDs associated with post-exposure time T and/or final severity colored bars, with Cluster 1 on the left (light blue) and Cluster 6 on 

score S at the a = 0.05 significance level. For each keyword, the the right (yellow). 



Despite the obvious benefits of incorporating clinical data in 
this setting, little work has been done to facilitate such analyses by 
effectively aggregating a set of disparate longitudinal clinical mea- 
surements in a practical and intuitive way. To contribute to this 



important area of research, we have applied Bayesian hierarchi- 
cal B-Spline models to the estimation of disease trajectories. Our 
fitted estimates provide helpful summaries of the clinical pro- 
files of each of our subjects and enable the direct incorporation 
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FIGURE 6 | Heatmap of expression profiles for 106 genes 
significantly associated with post-exposure time T and/or final 
severity score S and included in a set of 393 transcripts that 
were previously reported to effectively discriminate between cases 



of active and latent TB in blood samples (Berry et al., 2010). 

Hierarchical clustering of the samples on the basis of this subset 
separates 5 of the 6 least severe cases in our sample from the 
remaining 9 cases. 



of aspects of the individual disease progressions in a quantitative 
form. Furthermore, the modeling of continuous disease trajec- 
tories offers a great deal of analytical flexibility. While in our 
particular study we concluded that the duration of time post- 
exposure and the estimated severity of disease at the time of 
euthanasia were the most important explanatory factors for vari- 
ation in gene expression, one might imagine that in other settings 
different aspects of an estimated disease trajectory would be more 
predictive. For example, the frequency of bouts of acute illness 
might be more relevant for some conditions, while for others the 
time to recovery might be of particular interest. 

As an alternative to predicting disease outcomes, we have 
focused our attention on the incorporation of clinical profiles 
in the identification of biomarkers associated with observed dis- 
ease severity. Our results demonstrate that, even with a fairly 
limited set of subjects, our approach can identify key genes that 
have been shown to be factors in TB prognosis. This illustrates 
the potential of such integrated analyses for not only TB, but 
for a variety of complicated diseases in which subjects are mon- 
itored over time. While controlled experiments such as ours are, 
of course, limited to the laboratory setting, the ability to incorpo- 
rate longitudinal clinical profiles in the analysis of gene expression 
data from human subjects is certainly an option in many obser- 
vational studies and clinical trials. The estimation of individual 
disease trajectories in such studies would not only enable sig- 
nificant improvements in both the sensitivity and specificity of 
biomarker identification beyond current approaches, but would 
also provide insights into personalized treatment strategies. 
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